-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Copy tokenizer files in each of their repo #10624
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love it! Maybe a good practice to link to a sample of the related commits on hf.co: for instance here https://huggingface.co/facebook/bart-base/commit/c2469fb7e666a5c5629a161f17c9ef23c85217f7 |
I think I did around 50 of them in various repos to move all the tokenizers files, so a bit hard to keep track of all of them. |
Yep just link one, or a small sample. Makes it easier to see what this PR entails on hf-hub side |
* Move tokenizer files in each repo * Fix mBART50 tests * Fix mBART tests * Fix Marian tests * Update templates
What does this PR do?
This PR cleans the maps in the tokenizer files to make sure each checkpoint has the proper tokenization files. This will allow us to remove custom code that mapped some checkpoints to special files (like BART using RoBERTa vocab files) and take full advantage of the versioning systems for those checkpoints. All checkpoints changed have been properly copied in the corresponding model repos in parallel.
For instance, to accomodate the move on the fast BART tokenizers, the following commits have been on the model hub:
In the PR I've also uniformized the way the maps are structured across models, to make it easier to alter (and ultimately remove) them in the future via automatic scripts.